Gold price predictor

Goal

Recently, emerging world economies, such as China, Russia, and India have been big buyers of gold, whereas the USA, SoUSA, South Africa, and Australia are among the big seller of gold.

Forecasting rise and fall in the daily gold rates can help investors to decide when to buy (or sell) the commodity. But Gold prices are dependent on many factors such as prices of other precious metals, prices of crude oil, stock exchange performance, Bonds prices, currency exchange rates, etc.

Our goal in this project is to build a model that can predict gold adjusted closing prices, that investors can leverage to make well informed decisions.

Note: We have very little samples, so we are unable to train a model with a very good result; our goal will then be to demonstrate abilities.

Metric

We are using two main metrics to evaluate our model:

Dataset

Download dataset.

Data description

TODO: provide list of features here, as well as the description of each one

Loading the dataset, and getting a quick overview about it

Quick remarks about the informations

Quick remarks about isna().sum() and isnull().sum()

Although there are many columns which results cannot be seen, it can be observed that the result of isna().sum() and isnull().sum() are the same, so we don't have to suspect underlying problems with the data we have.

Dealing with multicolinearity

The plot is too congested to easily read informations from it. But:

We are going to drop features which correlation are higher than +/- 0.66

Though some features can also be dropped, we are satisfied with things as how they are. Now let's deal with the outliers.

Dealing with outliers

There are too many features to make a quick analysis with .describe() only. But the charts bellow will give us more informations about each feature.

Observation :

Observation
Nearly half of the samples are considered outliers. And there is not much data left to work with after outliers were removed. The way the data was collected should be reviewed, and tuned in order to improve the quality of the samples (have less outliers).
Another way to resolve the issue is to use Data Augmentation in order to generate new but good datas based on the original good samples.

Since the data come from Kaggle, the work will be continued as such, with the remaining 871 samples

After the removal of outliers, only one occurence may remains in some features. This generate a LinAlgError: singular matrix error when trying to plot them in a box plot with plot_distribution_in_features(), which is normal since all values are, as I said, the same. So those features shall immediatly been dropped.

We can see that we don't have features with only one occurence in their content.

Manual removal of outliers in Adj Close

Since this is our output variable, we will remove its outliers

Checking outliers presence in all features

Manual removal of outliers in Volume

Checking outliers presence in all features

Manual removal of outliers in SP_volume

Although there are still two outliers, we will let them as such since we don't have many data.

Checking outliers presence in all features

Manual removal of outliers in EG_volume

Checking outliers presence in all features

Manual removal of outliers in RHO_PRICE

Checking outliers presence in all features

Manual removal of outliers in USB_Price

Checking outliers presence in all features

Manual removal of outliers in GDX_volume

Checking outliers presence in all features

Removing outliers in a feature tends to generate outliers in other feature. Since we make sure to remove nearly all outliers in the output variable, and most outliers in the input variables, we are going to proceed to the next stage of the work.

EDA

TODO: perform EDA on the available data after outliers removal

Pre-processing data

Now we will fit the column transformer to the training data.
🔑 Whenever we have a column transformer, we need to fit it the training data and then use that fit column transformer to transform the test data.

Now we will take what was learned from the training data by the column transformer to transform the training and testing data.

Experimenting on ML models

We are going to experiment on different ML algorithms in order to find the best performing one.

We can see from the RMSEs that the trained models are not doing a very good job at predicting; however this is to be expected since we don't feed them enough data.

Lasso algorithm is the one giving us the best result, in respect to MSE and RMSE. So we are going to use it.

Having trained on the whole dataset, we have much better result, which is surprising since we didn't use many samples. This could be explained by the fact that the gold price doesn't fluctuate that much, but it is to be confirmed by EDA.